23 research outputs found

    Model Selection in Data Analysis Competitions

    Get PDF
    Abstract. The use of data analysis competitions for selecting the most appropriate model for a problem is a recent innovation in the field of predictive machine learning. Two of the most well-known examples of this trend was the Netflix Competition and recently the competitions hosted on the online platform Kaggle. In this paper, we will state and try to verify a set of qualitative hypotheses about predictive modelling, both in general and in the scope of data analysis competitions. To verify our hypotheses we will look at previous competitions and their outcomes, use qualitative interviews with top performers from Kaggle and use previous personal experiences from competing in Kaggle competitions. The stated hypotheses about feature engineering, ensembling, overfitting, model complexity and evaluation metrics give indications and guidelines on how to select a proper model for performing well in a competition on Kaggle.

    Inferring Person-to-person Proximity Using WiFi Signals

    Get PDF
    Today's societies are enveloped in an ever-growing telecommunication infrastructure. This infrastructure offers important opportunities for sensing and recording a multitude of human behaviors. Human mobility patterns are a prominent example of such a behavior which has been studied based on cell phone towers, Bluetooth beacons, and WiFi networks as proxies for location. However, while mobility is an important aspect of human behavior, understanding complex social systems requires studying not only the movement of individuals, but also their interactions. Sensing social interactions on a large scale is a technical challenge and many commonly used approaches---including RFID badges or Bluetooth scanning---offer only limited scalability. Here we show that it is possible, in a scalable and robust way, to accurately infer person-to-person physical proximity from the lists of WiFi access points measured by smartphones carried by the two individuals. Based on a longitudinal dataset of approximately 800 participants with ground-truth interactions collected over a year, we show that our model performs better than the current state-of-the-art. Our results demonstrate the value of WiFi signals in social sensing as well as potential threats to privacy that they imply

    On the number of spanning trees in random regular graphs

    Get PDF
    Let d≄3d \geq 3 be a fixed integer. We give an asympotic formula for the expected number of spanning trees in a uniformly random dd-regular graph with nn vertices. (The asymptotics are as n→∞n\to\infty, restricted to even nn if dd is odd.) We also obtain the asymptotic distribution of the number of spanning trees in a uniformly random cubic graph, and conjecture that the corresponding result holds for arbitrary (fixed) dd. Numerical evidence is presented which supports our conjecture.Comment: 26 pages, 1 figure. To appear in the Electronic Journal of Combinatorics. This version addresses referee's comment

    Inferring Stop-Locations from WiFi

    Get PDF
    Human mobility patterns are inherently complex. In terms of understanding these patterns, the process of converting raw data into series of stop-locations and transitions is an important first step which greatly reduces the volume of data, thus simplifying the subsequent analyses. Previous research into the mobility of individuals has focused on inferring 'stop locations' (places of stationarity) from GPS or CDR data, or on detection of state (static/active). In this paper we bridge the gap between the two approaches: we introduce methods for detecting both mobility state and stop-locations. In addition, our methods are based exclusively on WiFi data. We study two months of WiFi data collected every two minutes by a smartphone, and infer stop-locations in the form of labelled time-intervals. For this purpose, we investigate two algorithms, both of which scale to large datasets: a greedy approach to select the most important routers and one which uses a density-based clustering algorithm to detect router fingerprints. We validate our results using participants' GPS data as well as ground truth data collected during a two month period

    String Matching with Variable Length Gaps

    Get PDF
    We consider string matching with variable length gaps. Given a string TT and a pattern PP consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in TT that match PP. This problem is a basic primitive in computational biology applications. Let mm and nn be the lengths of PP and TT, respectively, and let kk be the number of strings in PP. We present a new algorithm achieving time O(nlog⁥k+m+α)O(n\log k + m +\alpha) and space O(m+A)O(m + A), where AA is the sum of the lower bounds of the lengths of the gaps in PP and α\alpha is the total number of occurrences of the strings in PP within TT. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of mm, nn, kk, AA, and α\alpha. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in PP for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

    Optimisation of Car Park Designs

    Get PDF
    The problem presented by ARUP to the UK Study Group 2014 was to investigate methods for maximising the number of car parking spaces that can be placed within a car park. This is particularly important for basement car parks in residential apartment blocks or offices where parking spaces command a high value. Currently the job of allocating spaces is done manually and is very time intensive. The Study Group working on this problem split into teams examining different aspects of the car park design process There were three approaches taken. These approaches include a so-called "tile-and-trim" method in which an optimal layout of cars from an `infinite car park' are overlaid onto the actual car park domain; adjustments are then made to accommodate access from one lane to the next. A second approach seeks to develop an algorithm for optimising the road within a car park on the assumption that car parking spaces should fill the space and that any space needs to be adjacent to the network. A third similar approach focused on schemes for assessing the potential capacity of a small selection of specified road networks within the car park to assist the architect in selecting the optimal road network(s). The problem is a variant of the "bin packing" problem, well known in computer science. It is further complicated by the fact that two different classes of item need to be packed (roads and cars), with both local (immediate access to a road) and global (connectivity of the road network) constraints. Bin-packing is known to be NP-hard, and hence the problem at hand has at least this level of computational complexity. None of the approaches produced a complete solution to the problem posed. Indeed, it was quickly determined by the group that this was a very hard problem (a view reinforced by the many different possible approaches considered) requiring far longer than a week to really make significant progress. All approaches rely to differing degrees on optimisation algorithms which are inherently unreliable unless designed specifically for the intended purpose. It is also not clear whether a relatively simple automated computer algorithm will be able to "beat the eye of the architect"; additional sophistication may be required due to subtle constraints. Apart from determining that the problem is hard, positive outcomes have included: Determining that parking perpendicular to the road in long aisles provides the most efficient packing of cars. Provision of code which "tiles and trims" from an infinite car park onto the given car park with interactive feedback on the number of cars in the packing. Provision of code for optimal packing in a parallel-walled car park. Methods for optimising a road within a given domain based on developing cost functions ensuring that cars fill the car park and have access to the road. Provision of code for optimising a single road in a given (square) space. Description of methods for assessing the capacity of a car park for a set of given road network in order to select optimal road networks. Some ideas for developing possible solutions further

    RUNTIME DICTIONARIES FOR ROOT

    No full text
    ROOT is the LHC physicists' common tool for data analysis; almost all data is stored using ROOT's I/O system. This system benefits from a custom description of types (a so-called dictionary) that is optimised for the I/O. Until now, the dictionary cannot be provided at run-time; it needs to be prepared in a separate prerequisite step. This project will move the generation of the dictionary to run-time, making use of ROOT 6's new just-in-time compiler. It allows a more dynamic and natural access to ROOT's I/O features especially for user code
    corecore